Database adventures

2013-06-30 18:53:17

(Warning--technical!)

I've been using this WordPress blog for some time now, and I've been wanting to learn more about how it works. This is partly to have a better sense of control over it. But I wanted to solve one specific problem: How to obtain, in a useful text format, all of the posts I've written, for the purposes of archiving them. Here's what I've found out.

It turns out that WordPress is (as best as I understand) not much more than a very glossy wrapper for a database, which does the hard work of storing all the posts and comments, and serving them in whatever order the web site user requests (e.g., show all blog posts on the subject of 'science'). The database software is a popular open-source database called MySQL. To create a WordPress site, you must rely on your site administrator to set up the database for you; all you need to know is the name and location of the database. The blog administrator (me) can access the database to do basic tasks, through a web-based interface (MyPhpAdmin) provided by the hosting service of the web site. One task is to back up the database itself, which can be downloaded as a special .sql file. This turns out to be a text file, and if you hunt, you'll find the text of all of your posts, etc. But it's not meant for humans to read or use. So I wanted to find out how to open up the file as a database myself, and get at the texts for the posts.

This turned out to be an interesting adventure. First, I had to acquire my own copy of MySQL, to run on my own computer (a MacBook Pro). This turned out to be not too difficult: go to MySQL website, download the appropriate file, which is set up to install itself. Then I had to figure out how to get MySQL to run. This was a bit more tricky. It turns out you need to create a user account for MySQL, and you must give the user certain permissions to enable creating or reading a database. Finally, I was able to open and work with the database from WordPress; the database .sql file is actually a set of commands, with accompanying data, for reconstructing the database. This file executed, and created the database. I was then able to enter commands to read the database. These commands were "SQL queries," which can be used to do such things as select and output all blog posts on the topic 'science,' or from a certain date. (SQL stands for "Structured Query Language.")

One other thing I did: I learned how to communicate with MySQL in a program written in the programming language R; this turns out to be fairly easy. (This can also be done without much trouble in languages such as Java or C++.) This is very useful, since it enables much more complex tasks to be completed, such as grab blog posts and output them in just the format desired. I used R to grab all blog posts and output them as a simple (HTML-formatted) web page. I should mention that MySQL is very popular, but a much simpler database that is also very popular is SQLite; I learned how to set up and use that in R as well. (By the way, the website for SQLite boasts about what organizations use it. They include in the list of users of their database the malware 'Flame.' This is apparently a military-grade program possibly deployed by the US or Israel.)

Though all of this, I learned one important thing: What, exactly, a "relational database" (a typical database) actually is. It's nothing more than a set of spreadsheet-like "tables" giving data in columns, for each of however many items, one item per row. In the case of WordPress, the database for a blog has 11 tables; these include a table containing all of the blog posts, and a table containing all blog comments. A better example to explain what a relational database is might be a database for a medical research project, where various subjects are given multiple tests with a medical device. This database might have two tables: One for patients, and one for test sessions. The first table would list for each patient data such as age, gender, and diagnosis. The second table would list for each test the patient, the date of the test, and the results of the test. This arrangement is efficient (so if you only had one table, for test sessions say, you would have to repeat for each test the information for the subject). It is also easy to work with: by avoiding repetition of data items, it is easier to change data in the table if need be.

What I still have to learn is how to use SQL in detail; I've only briefly played around with it. To that end, I found a nice little tutorial (Teach Yourself SQL in 10 Minutes, by Ben Forta). When I have time, or when I really need to learn more about this tool, I'll be able to learn this more thoroughly. But I have already accomplished one thing: I know a lot more about the basics of databases and how to use them.